---
id: tutorial-evaluations-catching-regressions
title: Catching LLM Regressions
sidebar_label: Catching LLM Regressions
---

In this section, you'll learn how to **compare evaluation results** for the same set of test cases, enabling you to identify improvements and regressions in your LLM performance.

:::tip
Detecting regressions is crucial as they reveal areas where your LLM's **performance has unexpectedly declined**.
:::

## Quick Recap

In the previous section, we updated our medical chatbot's model, temperature, and prompt template settings, and re-evaluated them on the same test cases and metrics, which resulted in the following report:

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_evaluation_11.png"
    style={{
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

We found that while all previously failing test cases now pass, one test case has regressed. The first test case, which previously achieved near-perfect scores, is now failing faithfulness and professionalism. This means that we've accidentally introduced a breaking change, and being able to catch these regressions is a vital part of a robust evaluation pipeline.

:::note
While regressions are scary, it’s equally important to be evaluating improvements. Examining specific test cases where scores have increased and understanding the reasons behind these changes ensures that any improvements align with our desired outcomes, and Confident AI provides an easy and simple way to **compare evaluation results** to identify both improvements and regressions.
:::

## Comparing Evaluations

To compare two evaluations, navigate to the **Comparing Test Runs** page, located as the third tab on the left navigation bar. Then, select the test run ID of the evaluation results you want to compare with your new results.

:::info  
A **test run on Confident AI** represents a single evaluation of a collection of test cases using a defined set of metrics. When you run `evaluate`, a test run is created on Confident AI, along with the testing report.
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_evaluation_15.png"
    style={{
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

Once you select the test run to compare with, Confident AI will automatically align the test cases and visually highlight the differences — improvements are marked as green rows, while regressions are shown in red.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_evaluation_16.png"
    style={{
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

:::info  
Confident AI matches test cases based on the `input` of each `LLMTestCase`. If no matching test cases are found, no comparisons will be displayed.  
:::

You can analyze each test case further by clicking on it to inspect individual regressing and improving metric scores. For instance, test cases 2, 4, and 5 show significant improvements in previously failing metrics, with their updated outputs aligning with our expectations.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_evaluation_13.png"
    style={{
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

Let’s take a closer look at the regressing test case:

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_evaluation_14.png"
    style={{
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

Here, we observe that **introducing additional flexibility** into the prompt template may have inadvertently caused some confusion during the generation process. As a result, the chatbot appears uncertain about whether to proceed with diagnosing the patient or to request further details, and ultimately fails to meet the standards for professionalism and faithfulness.

:::note  
Increasing the complexity of your prompt template can make it harder for an LLM to process queries effectively. **Upgrading the LLM model** is one way to address this challenge.  
:::

## Improving Your Model

We'll try to fix this by using a better model. The model hyperparameter will be upgraded to use `gpt-4o` instead of the original `gpt-3.5` model, and we can re-evaluate our LLM application one last time to verify the results.

```python
...
MODEL="gpt-4o"

evaluate(
  dataset,
  metrics = [answer_relevancy_metric, faithfulness_metric, professionalism_metric],
  hyperparameters={"model": MODEL, "prompt template": SYSTEM_PROMPT, "temperature": TEMPERATURE}
)
```

:::tip
Don't forget to actually use the `gpt-4o` model and not just log it for improved results!
:::

## Analyzing The Improved Chatbot

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_evaluation_17.png"
    style={{
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

We've finally managed to pass all test cases! After multiple iterations, our medical chatbot has successfully passed all the test cases, despite initially failing the majority of them.

:::note
This seems too good to be true, and it is. The reason why we managed to get everything right so quickly is because we've only evaluated 5 LLM outputs, which is far from a comprehensive evaluation dataset.
To ensure higher test coverage, you'll need a larger and more comprehensive evaluation dataset. Such a dataset should include challenging scenarios and edge cases to rigorously test your model's capabilities. While you could manually curate this dataset, doing so can be both time-intensive and expensive.
:::

In the next section, we'll dive into how you can **generate synthetic data** using `deepeval` to efficiently scale the evaluation of your LLM application.
